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The analysis of fields in periodic dielectric structures arise in numerous ap- 
plications of recent interest, ranging from photonic bandgap (PBG) structures 
and plasmonically active nanostructures to metamaterials. To achieve an accu- 
rate representation of the fields in these structures using numerical methods, 
dense spatial discretization is required. This, in turn, affects the cost of anal- 
ysis, particularly for integral equation based methods, for which traditional 
iterative methods require 0(N 2 ) operations, N being the number of spatial 
degrees of freedom. In this paper, we introduce a method for the rapid solu- 
tion of volumetric electric field integral equations used in the analysis of doubly 
periodic dielectric structures. The crux of our method is the ACE algorithm, 
which is used to evaluate the requisite potentials in O(N) cost. Results are 
provided that corroborate our claims of acceleration without compromising 
accuracy, as well as the application of our method to a number of compelling 
photonics applications. Manuscript submitted to J OS A A. 



1. Introduction 

The scattering of light from subwavelength dielectric arrays is fundamental to a number of 
contemporary problems in optics. In photonic bandgap (PBG) structures, gaps in the pho- 
tonic density of states arise due to a modification of the photonic dispersion in the vicinity 
of Bragg planes brought about by some underlying periodicity p. These structures have 
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been explored for a wide range of applications, from waveguiding structures [2] to experi- 
mental tests of cavity quantum electrodynamics [3]. Similarly, plasmonic structures exploit 
the unusual dielectric behavior of certain metals (namely, the nobles) at optical frequencies 
to focus electromagnetic energy at subwavelength scales. This focusing is typically the result 
of the excitation of surface plasmon-polaritons (SPP) modes, with dispersion relations that 
lie below the free space light-line [I]. One means of coupling light into these modes is the 
periodic modulation of the surface morphology [3] . This physical phenomenon has been ex- 
ploited for a number of applications, ranging from extraordinary transmission [6] to briding 
the gap between and electronic and optical circuitry [7J. Finally, metamaterial structures 
have been studied extensively in the hopes of finding media that can be homogenized, albeit 
at specific wavelengths and angles of incidence, in such a way that they effect the behaviors 
of a structure with a negative, near zero, or otherwise anomalous index of refraction [8]. 
While the noble metals are also often used in optical metamaterials, the associated losses 
can be too severe, leading to the development of active metamaterials, wherein dye molecules 
are used as gain-inducing dopants [9]. 

In all of the aforementioned applications, intuition and supplemental simulation have been 
successfully applied to the design and analysis of countless structures. However, this approach 
is bound to fail as designs become increasingly complex. It is the goal of computational 
electromagnetics and optics to supplant much of the trial and error associated with the 
design of such complex structures with rigorous in silico modeling. Typical systems that have 
been previously analyzed using full-wave methods are relatively small in terms of the number 
of spatial degrees of freedom, ranging from 100s-l,000s. However, a recent proliferation of 
research in PBG, SPP, and metamaterial systems with fine subwavelength features, and/or 
highly lossy/ dispersive materials lead to periodic structures wherein the unit cell requires 
10,000+ degrees of freedom to reliably represent the underlying physics. 

Integral equation (IE) methods for periodic scattering problems present a number of salient 
advantages over full-wave differential methods. IE methods give rise to solution fields that 
exactly satisfy proper boundary conditions through the use of an appropriate Green's func- 
tion, are free of numerical dispersion, and only require the discretization of scattering bodies. 
This is not to say that these methods are without their disadvantages, in particular, these 
methods give rise to dense matrices that require 0(N 2 ) resources in terms of number of 
operations and storage to achieve an iterative solution. Further, the evaluation of matrix el- 
ements by numerical quadrature is somewhat onerous, as the periodic Green's function must 
be evaluated at each pair of source and testing integration nodes. To this end, a considerable 
body of research has developed with the goal of mitigating these costs through the design 
and implementation of fast algorithms. 

Some popular fast methods include the Fast Multipole Method (FMM) [TO], the Adaptive 
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Integral Method (AIM) and numerous variations that address adaptivity, or low/high 
frequency bottlenecks. These methods have become very popular for the analysis of non- 
periodic scattering problems. While the framework for adapting FMM to periodic problems 
in statics has been around since [12], it was not until the mid-90's that FMM was adapted 
to periodic wave propagation [13J. Even so, it was used to reduce matrix fill time, rather 
than accelerating the iterative solution process- this was not realized until the work of Otani 
and Nishimura in 2008 [T3j . Aside from tree-based methods, AIM-based methods have been 
very naturally employed in the analysis of periodic problems using both conventional IE 
methods [15], as well as FE-BI [16]. Further, a number of interpolatory methods have been 
applied to these problems pT], including highly efficient GPU implementations [18]. In this 
work, we expand upon this growing body of research in fast algorithms for periodic systems, 
and provide details of the extension of the method of Accelerated Cartesian Expansions 
(ACE) a hierarchical, tree-based method similar to the FMM, to doubly periodic dielectric 
arrays. 

The ACE algorithm has previously been applied to the efficient computation of potentials 
of the form R~ v [19], Lienard-Wiechert potentials [20], diffusion, lossy wave, and Klein- 
Gordon potentials [21], and periodic Helmholtz, Yukawa, and Coulomb potentials [22]. Like 
FMM, it is based upon a hierarchical decomposition of the computational domain mapped 
onto an octree data structure, wherein a distinction between near and farfield source-observer 
aggregates is made, and an addition theorem is used to effect the interaction of all bodies 
with linear scaling. Whereas this addition theorem is based upon spherical harmonics in 
the FMM, ACE utilizes Cartesian harmonics and takes the form of a generalized Taylor 
expansion. In doing so, the salient features that develop are: 

• Totally linear scaling: in terms of both computational cost and storage. 

• Nearly kernel independent framework: only multipole-to-local operators depend upon 
the explicit form of the Green's function. 

• Exact up/down tree traversal: error is rigorously independent of tree height. 

• Amenability to non-uniform discretization: multiscale structures can be handled very 
naturally. 

• Excellent low frequency accuracy: conventional FMM for Helmholtz problems must be 
augmented for electrically dense problems (23j[23] . ACE has been shown to be very 
much complimentary to FMM in this regard [25]. 

It is this latter feature that we are particularly interested in exploiting in adapting ACE 
to periodic structures. In our primary applications of interest, the unit cells are at most 
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1 — 2A, as the technologically compelling physics arises due to energy coupled into low order 
Bloch-Floquet modes. In many applications, the unit cell may be as small as A/4 — A/10. 

The principal contributions of this work are two-fold, (i) an extension of the ACE algo- 
rithm that enables the efficient analysis of electrically dense periodic structures, and (ii) 
applications of this technology to a set of challenging problems. In presenting this work we 
will use the following layout. In Section 2, we provide a formal mathematical statement of 
the problem. In Section 3, we give details of the ACE algorithm, including the details neces- 
sary for its extension to volumetric periodic problems. Section 4 presents results that affirm 
our claims of linear scaling, convergence of our method to arbitrary accuracy, and utility in 
accelerating the analysis of a number of exemplary structures. Finally, Section 5 summarizes 
the contributions of this paper and gives a brief outline of work to come. 

2. Problem Statement 

Consider a doubly periodic array of dielectric scatterers, Qd, embedded in M 3 , characterized 
by the dielectric function e(r, u). The array is excited by an incident planewave, E inc (r, t) = 
EVM-k mc .?\ that gives rise to a polarization of the dielectric. In turn, this polarization 
field radiates a scattered field, E scat (r, t), such that the total field is given by E(r, t) = 
Ej nc (?, t) + Fj scat (r, t). Considering only linear media, all fields will be time-harmonic, and 
we henceforth suppress an implicit factor of e tuJt and all time- dependence. 

The array upon which the geometry is arranged is characterized by the following 2-lattice, 

u 

C>2 = {t m , n = m&i + n& 2 | m,n eZ} (1) 

Here, a\ and a 2 are the lattice vectors, describing the periodicity of our array. Throughout, 
we will assume a square lattice, i.e., |a*i| = |a 2 | and a\ • a 2 = 0, though we note that our 
method can be extended to more irregular lattices (i.e., rectangular or skewed) with minimal 
modification. For completeness, we define the reciprocal lattice associated with £ 2 as: 

£ 2 = {k mj „ = mbi + nb 2 m, n 6 Z} (2) 

Where a*j • hj = 27rSij, Sij being the Kronecker Delta. A simple illustration of our problem is 
provided in Fig. [TJ 

We seek to resolve the unknown, E scai (r), from which we can also compute quantities 
such as the scattering parameters, e.g., reflection, transmission, and absorption spectra, 
associated with Qd. In doing so, we utilize the volumetric equivalence principle to replace 
Q D with equivalent sources, J\/(r), radiating into a homogeneous space [26]. We relate Jy(?) 
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to the electric displacement, D(r) = e(f, w)E(r), via the following relationship: 

jV(r) = juK(r)~D(r), where k(t) = — r— - (3) 

e(r, w) 

With this substitution, any discontinuities in the normal component of Jy(r) across material 
interfaces is due to k(y) rather than D(r), which facilitates the definition of local vector basis 
functions [27] . 

— * —* 

By relating E scat {r) to Jy(r), and enforcing the identity of the total field, we arrive at the 
following Volume Integral Equation (VIE): 

E inc (f) = E(r) - E scat (r), Vr G Q D (4a) 
E mc (?) = D(f)/e(f,o;)-za; yUo J dr'g(v,?)K(?)B(?) - -^-V J dfgft r')V • (k(?)D(?)) 

(4b) 

Here, g(r, r') is the free space Green's function for the Helmholtz equation in 3D. By taking 
advantage of periodicity, we restrict our consideration to a single unit cell, Qo C M 3 , and 
replace g(r, r') with the quasiperiodic Green's function for the Helmholtz equation in 3D, 
g per (r,r'). Using g per (r,r') a radiation boundary condition is enforced receding away from 
the array, and Bloch-Floquet boundary conditions are enforced in the plane of £2, giving the 
following relationship between fields/currents in different unit cells: 

E(r + t m , n ) =e^"E(r) (5) 

Equation (j3J) can then be reformulated as: 

E inc (r) = D(r, u)/e(r, u) - J dr% er (r, r') ■ [ K (f")D(r / )] , tl* D = supply) n fi (6) 

n h 

Here, G per (r, r') is the Quasiperiodic Electric Dyadic Green's function, constructed from the 
scalar Green's function, g per (r,r'), per the formalism in [28] . 

To render a finite system of equations, Q* D is represented as a tetrahedral tesselation with 
N faces, each of which is assigned a linear vector/Schaubert-Wilton-Glisson (SWG) basis 
function [27]. Galerkin testing is subsequently applied [26], yielding a linear system of N 
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equations in N unknowns of the form: 



h = c», V^= (f M (?) , E inc (?) ) 



(7a) 
(7b) 



= (f M (r),t(r)/ £ (r,a;)) - (f M (?), / c/?G per (?,?') • ) 



(7c) 



/V 



D(r) = ^c J f l (r), VrGfi 



(7d) 



i=i 




The expansion coefficients for D(r), q, can be resolved using an iterative method [29] in 

0(N 2 ) operations. These methods require the generation of a minimal sequence of vectors in 
the range of Z^ u from which an approximate solution to Eqn. ( JTaj) can be constructed such 
that \\Z^ V I U — V^t 1 1 2 < e, where e is some designated tolerance for error. This 0(N 2 ) cost 
can be reduced via the application of fast solvers, wherein the dominant cost of the iterative 
solution process is reduced to 0(N log^ (A)) for a G [0, 1]. In the case of the ACE algorithm, 
a — 0, and O(N) scaling has been demonstrated in numerous applications. In the Section 
3, we give details of how this is achieved for the class of volumetric, doubly periodic vector 
Helmholtz problems described above. 

3. ACE Algorithm 

The crux of the ACE algorithm is the Generalized Taylor Expansion (GTE) expressed in 
the framework of Cartesian tensors: 



Here, /(?) is some smooth function, -n- indicates an n-fold tensor contraction, and the su- 
perscript (n) indicates a tensor of rank n. The premise of this equation, and of the ACE 
method, is to utilize the GTE in constructing an addition theorem for the periodic Green's 
function, wherein the source (?) and observer (?) domains are separated. As in other analyt- 
ical acceleration methods, it is this separation that enables the contruction of a hierarchical 
decomposition of the potential integrals (rendered discretely as matrix- vector products) that 
are repeatedly evaluated in the iterative solution of Eqn. (jTJ). To use this principle in achiev- 
ing a linear scaling algorithm, we need to first designate when it is that we will utilize this 
separation to effect a portion of a matrix- vector product (i.e., nearfield vs. farfield metric), 
and then construct a mathematical framework that describes the process (i.e., tree traversal) 
in which it is applied. To this end, we describe: 

• Construction of a hierarchical decomposition of Q* D and its utility in identifying 
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nearfield vs. farfield basis function pairs. 



• Application of the tree traversal process to effect a matrix- vector product. 
3. A. Tree Construction 

In the ACE algorithm, the matrix-vector product in Eqn. flTaJ is approximated as: 



Here, Z^ ar is a sparse matrix with O(N) entries which describe only interactions between 
pairs of basis functions that are in some metric near. The remaining term, T ACE is some 
composition of operators that effects the remaining farfield contribution to the total in O(N) 
cost. The distinction between nearfield and farfield is made apparent by mapping the basis 
functions subordinate to Q* D onto a hierarchical decomposition, i.e., a tree. This is accom- 
plished by embedding the unit cell inside of a cubic box and recursively subdividing it into 
increasingly smaller cubic boxes until some desired density of basis functions per box, a, is 
achieved. This entire hierarchy of boxes is stored and represented in terms of a regular octree 
data structure. Boxes are referred to in terms of their genealogy, e.g., the box subordinate 
to a box at the level directly above it is the child to the superordinate box's parent. This 
genealogy is used in constructing the nearfield/farfield dictum as follows - the basis functions 
subordinate to two boxes constitute a farfield pair if: 

• The separation between the original boxes and their nearest periodic images is, in all 
cases, at least a box length. 

• Among the parents of both the original boxes and their images, at least one pair are 
in each other's nearfield. 

The tree structure, and an exemplary application of this dictum are illustrated in Fig. 

It is important to note that the designation of nearfield and farfield does not carry the 
same physical significance as it typically does, but is only used in analogy with conventional 
free space tree-based methods. In practice, we implement this augmented rule by simply 
adding 2 auxiliary levels to our tree to account for the 8 nearest images of the unit cell, 
which are addressed but not explicitly 'filled'. As we utilize Morton ordering, the boxes lying 
in the original unit cell lie inside of a contiguous address space and we can easily distinguish 
between image boxes and the original boxes, but still utilize the same list building routines as 
in a free space code. This augmentation has a negligible effect on the computational cost, and 
can be considered a step in pre-processing as the image boxes are not explicitly traversed. 




(9) 
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3.B. Tree Traversal 

Next, we discuss the application of T ACE . As the ACE algorithm is formulated in the lan- 
guage of Cartesian tensors, it is necessary to first provide notational details. and 
are deemed the multipole and local expansions. These quantities are 3-vectors, indexed by 
/x, each component of which is a totally symmetric Cartesian tensors of rank n. A rank n 
Cartesian tensor on IR 3 is a set of 3 n quantities indexed by x, y, and z in each rank. Such 
a tensor is totally symmetric if these quantities are left invariant under permutation of the 
indices across rank, and will thus consist of n(n + l)/2 unique quantities. We denote an n-fold 
contraction between such tensors as -n-, and any vector quantity such as V or r taken as 
an n-fold tensor product with itself is indicated as or r^ n \ respectively. Further details 
concerning tensor notation, as well as proofs of the identities that will follow can be found 

in pig. 

For simplicity, we restrict ourselves to a 3-level tree, in which there will be no up/down 
tree traversal, noting that the details of a multilevel implementation can be found in [T91I25]. 
We begin by considering two disjoint subdomains of Q,q, Q s and Q Q with centroids and 
r£, deemed the source and observer domains, respectively. These subdomains are spatially 
separated in such a way that they separately fall inside leaf boxes that satisfy the farfield 
criterion described in Section 3. A The first step in tree traversal is the construction of 
ACE multipole expansion about r c s , commonly referred to in the literature as the 'charge- 
to-multipole' or C2M step. For an ACE expansion truncated beyond order P, this amounts 
to calculating the 0(P 3 ) unique entries of the following set of totally symmetric Cartesian 
tensors: 

N " (—~\ ) n 

M P = E ~T~ w ^ - ^) {n \ 0<n<P (10) 
i=i 

Here, N s is the number of quadrature points used in discretizing the contribution of the 
source integrals in Eqn. ((7j) due to basis functions in Q s , and w^i is the weight associated 
with the /ith vector component of the ith quadrature point. 

The coefficients contained in the M p tensors, combined with a knowledge of the Taylor 
coefficients of the Green's function allow us to calculate the fields at any point in r£, from a 
local expansion about r£, defined as: 

p i 

L i n) = E ^ M i m " n) ■ (m - n) • V^g per (\K ~ K\l 0<n<P (11) 

m=n 

Here, the afforementioned Taylor coefficients are contained in the set of totally symmetric 
tensors, W^g per (\r^ — ?^|),0 < m < P, referred to as the multipole-to-local (M2L) transla- 
tion operator. As the quasi-periodic Green's function is an infinite sum, an efficient means 
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of calculating the elements of the translation operator is necessary. To this end, the Ewald 
representation of the quasi-periodic scalar Helmholtz Green's function is used to generate 
these coefficients, a more detailed discussion of which can be found in [22]. We will make 
clear how the dyadic character of the Green's function used in the VIE can be accounted for 
using the scalar Green's function in what follows. 

The final step in computing the fields in Vt due to sources in f2 s , E*°(r), is termed the local- 
to-observer (L20) step. This essentially amounts to the evaluation of the last two terms on 
the right hand side of Eqn. (14bl) . These fields are evaluated using the following relationship: 

/ V( 2 )\ p 
E-(r)= (l (2) + ^r) -l-ELW.n.(?-^)W FeO, (12) 

Here, 1^ is the rank 2 identity tensor. As L^T is a constant and the source and observer 
domains are disjoint, V 1 - 2 ^ can be applied directly to the (r — tensors. From these 

fields, the testing integrals in Eqn. flj]) can be evaluated, completing the farfield contribution 
to the matrix-vector multiplication. It is worth noting, that we can alternatively evaluate 
the tested field by transferring the V*- 2 - 1 onto the source and testing basis functions prior 
to tree traversal. In this case, it is necessary to traverse an additional tree to account for 
the static contribution to the field. In practice, the difference between these two approaches 
comes down to a trade off between speed and accuracy. Transferring the derivatives typically 
results in a slightly more accurate evaluation of the field, whereas evaluating the field using 
the dyadic relation in Eqn. ffl2|) is much more efficient. Consequently, the results presented 
in this paper have been obtained using the latter strategy. 

3.C. Computational Cost 

The cost structure of this scheme is identical to the one presented in [22], with the excep- 
tion of some minor details. In [22] it is demonstrated that this periodized version of ACE 
provides O(N) evaluation of scalar potentials in terms of both FLOPs and memory. There 
are two primary differences between the scalar potentials analyzed previously and the vector 
potentials presented in this paper. First, a separate tree must be traversed for each vector 
component of the field, which will increase the number of operations and memory required 
for a tree traversal by a factor of either 3 or 4 depending upon whether or not the static con- 
tribution to the potential is evaluated using a separate tree, as previously discussed. Second, 
the quantity N is no longer the number of degrees of freedom (i.e, basis functions in this 
case), but proportional to this quantity. Instead, it is determined by the source and testing 
quadrature rules used in discretizing the necessary MoM integrals. While this doesn't affect 
the O(N) scaling of the method, it does affect the optimal number of degrees of freedom 
per leaf box used in minimizing cost. In the results presented, the average number of basis 
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functions per leaf box is ~ 20. 
4. Results 

Next, we present several results that: 

• Validate the proposed acceleration scheme and demonstrate convergence. 

• Demonstrate the applicability of our method to the analysis of a numerical of estab- 
lished results. 

• Explore the utility of the algorithm in analyzing a number of interesting photonic 



Both our ACE accelerated and reference codes were written in Fortran. They were compiled 
using the Intel Fortran Compiler with — r8 — 03 flags (double precision and optimization), 
and run in serial on an Intel Xeon E5620 at 2.4 GHz with 24 GB of RAM running Linux OS 
at the High Performance Computing Center at Michigan State University. In generating all 
results, we utilize the a diagonal-preconditioned TFQMR iterative method with a tolerance 
of < 0.1% in the L2 norm. This choice of norm is implicit throughout the remainder of 
this work. All integrals over tetrahedra are evaluated with a 5 point source, 1 point testing 
rule, whereas all integrals over triangles are evaluated with a 7 point source, 1 point testing 
rule. This is of course, excepting singular integrations, which are evaluated using an analytic 
singularity subtraction in conjunction with quadrature [30J. All periodic Green's function 
evaluations, in both Ewald and Floquet representations, are converged to an accuracy of 
< 0.001%. All infinite series in the periodic ACE translation operators are evaluated to a 
similar accuracy. In all cases, we note that we can significantly reduce the runtime of our 
algorithm by reducing the order of our ACE expansions and/or the convergence criterion 
for infinite summations. Unless otherwise indicated, we have utilized 5 levels and P = 7 to 
guarantee an error of ~ 0.01%, in line with the results presented in [22] and in this work. In 
most scattering-based applications, a lower error tolerance/order of ACE expansion can be 
used with minimal recourse to the computed farfield observables. Finally, all exciting fields 
are normally incident on the plane of periodicity, excepting the results in Fig. flSJ). 

The first two results are obtained using a 'kernel code' from which error convergence and 
scaling can be extracted independent of the framework of an integral equation solver. This 
'kernel code' evaluates the following convolution, both with and without ACE: 



structures. 




(13a) 
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N 

i=i 

Here, iV is the number of co- located point sources/observers, and Wi and denote the weight 
and position associated with the ith source, respectively. We are primarily concerned with 
the farfield contribution to this potential, i.e., for a fixed point observer, Fj, we restrict the 
integration in Eqn. ( 11 3 aft to include only contributions from sources that are in the farfield 
of the observer as defined by the ACE algorithm, independent of whether or not ACE is 
used to evaluate the convolution. We denote this partial contribution to $(r) as $/ ar (F). 

In Fig. ([3]) error convergence with the order of the ACE expansion is demonstrated using 
the 'kernel code'. Here, the error was calculated using the following formula for 9 different 
values of P: 



\ 



N 
i=l 



N 

£ | 1 1 2 



(14) 



i=l 

In the results presented, N = 1,000 point sources were distributed randomly throughout a 
unit cell, Q = [0, 1) x [0, 1) x [0, 1) (i.e., the unit cube) and the weights, Wi, were chosen 
from a uniform random distribution on [—1,1). The phasing in g per {?, ?') is characterized by 
an incident field with k inc = 27ra 1 x a 2 , i.e., it is normally incident on the periodic array and 
has A = 1. The results presented in Fig. ([3]) indicate that the ACE algorithm can be tuned 
to an arbitrary level of accuracy by increasing the order at which the associated expansions 
are truncated. We have previously demonstrated the effect of changing incidence angle and 
wavelength on error for a fixed value of P [22], and note here, that it has been found that 
convergence is largely unaffected by incidence angle, but not wavelength. Relative to the 
results presented, as wavelength decreases, convergence in P is slower, whereas it is typically 
faster as wavelength is increased. The conditions for this numerical experiment were chosen to 
demonstrate convergence in something of a worst case scenario for the applications at hand 
- as most of the technologically compelling physics for PBGs, plasmonic nanostructures, 
and metamaterials arise when the unit cell is subwavelength. It is worth noting that €f ar 
is evaluated relative to $j ar (r), rather than $(r). As the nearfield contribution to the $(r) 
is evaluated exactly, the error with respect to the total potential is typically an order of 
magnitude smaller. 

The second set of 'kernel code' results are given in Fig. HJ Here, scaling of the precompu- 
tation and tree traversal times with the number of points sources N are presented. Timings 
are presented for 5 different values of P, demonstrating that linear scaling can be achieved 
at arbitrary precision. In generating the results in this Figure, N sources are distributed over 
a unit cell with |a*x| = |a 2 | = 1, with a maximum out-of-plane dimension of 1, and k = 2n. 
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The number of levels is increased at each value of N, from 3 to 6, maintaining an average 
density of unknowns per leaf box of 80. The precomputation timings in Fig. 4(a) include 
the time required for tree construction as well as the calculation of all unique translation 
operators that will be later applied during tree traversal. This is a one time cost, and figures 
are provided to demonstrate two things: i) precomputation scales weakly with the number 
of sources/number of levels and ii) precomputation is negligible on the time scale required 
for solution, as will be evident from subsequent results. The tree traversal timings in Fig. 



4(b) include the time required for all 5 steps of the tree traversal project. These results are 



intended to demonstrate that linear scaling is achieved for varying levels of accuracy (i.e., 
different values of P). A linear regression yields a scaling of N im for all values of P except 
for P = 1, in which the scaling is N im . 

Our first result illustrating integration with a VIE solver is given in Fig. [51 In it, we 
demonstrate the accuracy of our method in reproducing power transmission through an in- 
finite dielectric slab of width 20 nm as the angle of incidence and polarization is varied at a 
fixed wavelength, A = 400 nm. A similar result is presented in [31], wherein a layered-medium 
formulation is compared against an analytic calculation. Here, we utilize a periodic volumet- 
ric formulation wherein the unit cell has an edge length of 80 nm. As is evident, our ACE 
accelerated code reliably reproduces the analytic solution at both TE and TM polarization, 
at arbitrarily shallow incidence, for both positive and negative relative permittivities. 

The results in Fig. [6] and Fig. [7J are intended to illustrate i) agreement of our code with 
established results from the literature and ii) acceleration of our code relative to our own 
unaccelerated reference code. The structure in Fig.[6]is an electromagnetic bandgap structure 
with applications in microwave engineering. The reference data is taken from [32], wherein a 
hybrid finite element-boundary integral formulation is utilized. Our ACE accelerated code is 
in good agreement with the established result, and we report a 37x speedup relative to our 
unaccelerated code. The result in Fig. [7J is taken from [33] wherein an analytic calculation is 
performed using Mie theory. We again find good agreement between ACE and the established 
result and report a 46 x speedup relative to our reference code. This structure has been used 
elsewhere in the literature as validation of a periodic FMM code, wherein the authors report 
4—18 minutes per solution on an 8 processor platform [T3]. As a surface integral equation 
formulation is utilized, as well as parallelism, it is difficult to draw a direct comparison 
between our results. Taking these differences into consideration, it is safe to at least claim 
that our timings are competitive with the state of the art. 

Our next result is intended to demonstrate the utility of the ACE algorithm in studying 
models for novel experimental applications. It has been understood for many years that a 
number of the optical effects common in the wings of butterflies, such as their deep coloring 
or iridescence, arise due to periodic nanostructures that occur naturally in the scales covering 
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these wings, on top of any chemical coloring (i.e. pigment) that may exist [34T436] . The blue 
coloring of the Morpho butterfly is partially due to a photonic bandgap (PBG) in the blue 
portion of the visible spectrum. This PBG is supported by a periodic ribbing in the material 
covering the butterfly's wings on the scale of visible wavelengths [51] . Previously, Huang, et 
al. [37] have performed an extensive analysis of the optical properties of not just naturally 
occuring wings, but coated and synthetic wings made from Al 2 0"3. We have constructed a 
simple model to replicate the blue structural coloring of these types of structures, albeit one 
that is both thinner and oriented differently than the one in [37]. The reflectance of this 
structure, as well as its dimensions are given in Fig. |HJ Dielectric constants for AI2O3 were 
interpolated from [38] and the resultant mesh has N = 10, 782 unknowns. It is worth noting 
that we have artificially enlarged the unit cell for this problem to ensure that it is cubic. This 
both simplifies tree construction and increases the number of degrees of freedom for demon- 
strative purpose, and is not, in principal, necessary. In this result, the unaccelerated time to 
solution was extrapolated based upon the nearfield matrix fill time for ACE alone, neglecting 
the time for iterative solution, so the speedup factor of 46 x is actually representative of a 
lower bound on the acceleration. 

Our final result is a demonstration of capability in solving a very densely discretized 
structure with a highly dispersive dielectric response. For inspiration, we turn to [39J wherein 
a yellow-light fishnet metamaterial made of layered AI2O3 and silver on an ITO substrate is 
simulated and fabricated. Here, we model a structure with the same periodicity and nanostrip 
widths but without tapering and an unmodified dielectric function for silver. While we utilize 
the well-established Johnson and Christy permittivity for silver [ID], the authors of [39] 
increase the collision rate to account for surface roughness, grain boundaries, and size effects, 
thus we do not compare our results directly. Our resultant mesh has N = 147, 374 unknowns, 
and was solved using ACE with 6 levels and P = 5. TFQMR was applied with diagonal 
preconditioning to arrive at an error of < 3% with an average of 188 iterations per frequency 
at 20 frequencies. The average time per iteration was 3.18 minutes, and the average total 
time to solution was 896 minutes. Extrapolating our nearfield matrix fill time indicates that 
the matrix fill process alone would require 256, 500 minutes using our reference code with 
the same parameters, yielding a lower bound on the speedup of ~ 286 x. 

5. Conclusion 

We have demonstrated the extension of the ACE algorithm to the efficient analysis of vol- 
umetric integral equation-based scattering problems on periodic domains. Results were pre- 
sented that indicate convergence of the kernel to arbitrary precision, as well as linear scaling 
in the evaluation of matrix- vector products. Corroborating results affirming the utility of our 
method in solving problems from the literature were also provided, as well as a number of 
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large problems that demonstrate capability in solving problems posed on densely discretized 
geometries pertinent to a number of subfields of the optics community. Work is presently un- 
der way in adapting our algorithm to both MPI and CUDA parallelism, surface formulations 
for dielectrics using the Generalized Method of Moments |JT] , and time domain problems. 
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Fig. 1. Illustration of the periodic scattering problem described in Section 2. 
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Fig. 2. Top-down view of the geometry illustrated in Fig. [T]with a 4 level octree 
structure superimposed. Interaction lists are indicated for the dark blue box. 
Boxes highlighted in blue are in the nearfield, whereas light blue boxes are in 
the farfield. The interaction between sources in the dark blue box and boxes 
highlighted in red are effected at a higher level. 
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(a) Scaling of precomputation time with the number of point sources for expansions truncated at different 
values of P 
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(b) Scaling of tree traversal time with the number of point sources for expansions truncated at different 
values of P. 



Fig. 4. Timing results for ACE 'kernel code'. 
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Fig. 5. Validation of our ACE accelerated code against an analytic solution for 
scattering from a homogeneous dielectric slab of width 20 nm with e r = ±4. 
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Fig. 6. Validation against scattering from an electromagnetic bandgap (EBG) 
structure solved using FE-BI in [32]. Discretization has N = 6, 030 unknowns. 
Average time to solution (without acceleration): ~ 813 minutes. Average time 
to solution (ACE acceleration): ~ 23 minutes. Total speedup: ~ 37x. 
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Fig. 7. Validation against scattering from an array of polystyrene spheres (e r = 
2.56) solved analytically in f33|. Discretization has N = 7,328. Average time 
to solution (without acceleration): ~ 2006 minutes. Average time to solution 
(ACE acceleration): ~ 43 minutes. Total speedup: ~ 46 x. 
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Fig. 8. (Top) Calculated reflectivity of a single model butterfly scale. (Bottom) 
Scale geometry. A single unit cell with |ai| = |a2 1 = 320 nm is outlined in red. 
The height of the structure out of plane is 280 nm and the diameter of the 
larger cylinders is 130nm, with a center-center spacing of 160 nm between near- 
est neighbors. The smaller cylinders of diameter 20 nm with axes in the plane 
of periodicty are oriented along the polarization vector of the incident field, 
with a center-center spacing of 70 nm out of the plane of periodicity. The re- 
sultant mesh has iV = 10, 782. Average time to solution (without acceleration, 
extrapolated): ~ 1430 minutes. Average time to solution (ACE acceleration): 
~ 31 minutes. Total speedup: ~ 46 x. 
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Fig. 9. Demonstration of capability in solving a large scattering problem (N = 
147,374) inspired by a metamaterial design presented in j39j. An average of 
188 iterations per frequency was required, with an average time per iteration 
of 3.18 minutes, and an average total solution time of 896 minutes. 
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